Skip to content

HDDS-14669. Implement a new async finalize command which does not block on the server#10152

Merged
sodonnel merged 14 commits intoapache:HDDS-14496-zdufrom
sodonnel:HDDS-14669
Apr 30, 2026
Merged

HDDS-14669. Implement a new async finalize command which does not block on the server#10152
sodonnel merged 14 commits intoapache:HDDS-14496-zdufrom
sodonnel:HDDS-14669

Conversation

@sodonnel
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The original scm finalize command blocked until SCM and enough datanodes had finalized. The new design is that the finalize command should kick off the finalize process and then return immediately. Any other process which needs to see the progress must call the finalize status command to see if it has completed or not.

This change adds a new protobuf message which triggers finalize and returns. There is a new "ozone admin upgrade finalize" command which triggers finalize.

The existing tests are adjusted so they call the new command and then poll the status command to see if finalize has completed or not.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14669

How was this patch tested?

Existing integration tests modified to call the new flow.

Added a simple test to validate the new CLI command.

@github-actions github-actions Bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Apr 28, 2026
@errose28 errose28 requested review from dombizita and errose28 April 28, 2026 13:04
Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just left some minor comments. The UpgradeFinalizer parts of this change will be replaced as part of HDDS-15129 but this is a good starting point for the switch.

if (status == FINALIZATION_REQUIRED) {
finalizationExecutor.execute(service, this);
}
} catch (NotLeaderException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should automatically be propagated back to the client without any extra handling required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it won't propagate back with the catch block in place, so I think we should re-throw after the log?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the catch or log is necessary. This is now a single ratis request so it works the same as others like close container, close pipeline, etc. If the contacted node is not the leader the finalization will not happen at all and the client's failover proxy should retry on the leader automatically.

Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only functional comment I have left is on the NotLeaderException handling. Everything else LGTM.

if (status == FINALIZATION_REQUIRED) {
finalizationExecutor.execute(service, this);
}
} catch (NotLeaderException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the catch or log is necessary. This is now a single ratis request so it works the same as others like close container, close pipeline, etc. If the contacted node is not the leader the finalization will not happen at all and the client's failover proxy should retry on the leader automatically.

@sodonnel sodonnel merged commit 4ab7a96 into apache:HDDS-14496-zdu Apr 30, 2026
83 of 85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants